{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# aitextgen Training Hello World\n",
    "\n",
    "_Last Updated: Feb 21, 2021 (v.0.4.0)_\n",
    "\n",
    "by Max Woolf\n",
    "\n",
    "A \"Hello World\" Tutorial to show how training works with aitextgen, even on a CPU!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from aitextgen.TokenDataset import TokenDataset\n",
    "from aitextgen.tokenizers import train_tokenizer\n",
    "from aitextgen.utils import GPT2ConfigCPU\n",
    "from aitextgen import aitextgen"
   ]
  },
  {
   "source": [
    "First, download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), to the folder with this notebook, then put the name of the downloaded Shakespeare text for training into the cell below."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_name = \"input.txt\""
   ]
  },
  {
   "source": [
    "You can now train a custom Byte Pair Encoding Tokenizer on the downloaded text!\n",
    "\n",
    "This will save one file: `aitextgen.tokenizer.json`, which contains the information needed to rebuild the tokenizer."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_tokenizer(file_name)\n",
    "tokenizer_file = \"aitextgen.tokenizer.json\""
   ]
  },
  {
   "source": [
    "`GPT2ConfigCPU()` is a mini variant of GPT-2 optimized for CPU-training.\n",
    "\n",
    "e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2. This dramatically speeds training up."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = GPT2ConfigCPU()"
   ]
  },
  {
   "source": [
    "Instantiate aitextgen using the created tokenizer and config"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "ai = aitextgen(tokenizer_file=tokenizer_file, config=config)"
   ]
  },
  {
   "source": [
    "You can build datasets for training by creating TokenDatasets, which automatically processes the dataset with the appropriate size."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stderr",
     "text": [
      "100%|██████████| 40000/40000 [00:00<00:00, 86712.61it/s]\n"
     ]
    },
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "TokenDataset containing 462,820 subsets loaded from file at input.txt."
      ]
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "source": [
    "data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)\n",
    "data"
   ]
  },
  {
   "source": [
    "Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder. On a 2020 8-core iMac, this took ~25 minutes to run.\n",
    "\n",
    "The configuration below processes 400,000 subsets of tokens (8 * 50000), which is about just one pass through all the data (1 epoch). Ideally you'll want multiple passes through the data and a training loss less than `2.0` for coherent output; when training a model from scratch, that's more difficult, but with long enough training you can get there!"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stderr",
     "text": [
      "pytorch_model.bin already exists in /trained_model and will be overwritten!\n",
      "GPU available: False, used: False\n",
      "TPU available: None, using: 0 TPU cores\n",
      "\u001b[1m5,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m5,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      "'s dead;\n",
      "But is no winted in his northeritiff\n",
      "Tave passage, and eleve your hours.\n",
      "\n",
      "PETRUCHIO:\n",
      "What is this I does, I will, sir;\n",
      "That, you have, nor tolding we\n",
      "==========\n",
      "\u001b[1m10,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m10,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      ".\n",
      "\n",
      "QUEEN ELIZABETH:\n",
      "I know, to, fair beat, to my soul is wonder'd intend.\n",
      "\n",
      "KING RICHARD III:\n",
      "Hold, and threaten, my lord, and my shame!\n",
      "\n",
      "QUEEN ELIZAB\n",
      "==========\n",
      "\u001b[1m15,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m15,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      "s of capitcts!\n",
      "\n",
      "EDWARD:\n",
      "Gardener, what is this hour will not say.\n",
      "What, shall the joint, I pray, if they\n",
      "Harry, let bid me as he would readness so.\n",
      "\n",
      "B\n",
      "==========\n",
      "\u001b[1m20,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m20,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      " for.\n",
      "\n",
      "ROMEO:\n",
      "Fair to the iercing wide's fretch,\n",
      "And happy talk of the master,\n",
      "And waste their justice with the feet and punning,\n",
      "And therefore be ben\n",
      "==========\n",
      "\u001b[1m25,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m25,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      ",\n",
      "That we we will have not lose such.\n",
      "\n",
      "See, to the kingdom of our virtue,\n",
      "You banish'd our purpose, for our own ignorse,\n",
      "Dispon I remain, and seem'd in\n",
      "==========\n",
      "\u001b[1m30,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m30,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      ".\n",
      "\n",
      "BENVOLIO:\n",
      "O, she's dead!\n",
      "\n",
      "CAMILLO:\n",
      "No, my lord;\n",
      "These accession will be hous.\n",
      "\n",
      "DERBY:\n",
      "No, my lord.\n",
      "\n",
      "GLOUCESTER:\n",
      "What is the\n",
      "==========\n",
      "\u001b[1m35,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m35,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      ",\n",
      "And whiles it is but the castle,\n",
      "That stavin'd in the gods of men.\n",
      "\n",
      "COMFEY:\n",
      "What, then?\n",
      "\n",
      "ELBOW:\n",
      "Peace, my lord,\n",
      "And weat your greats\n",
      "==========\n",
      "\u001b[1m40,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m40,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      "\n",
      "The white mercy of the sun upon my past,\n",
      "Of my father's son be first, thy sake,\n",
      "His son's chief son, and my includy;\n",
      "And if thy brother's loss, thy thrief,\n",
      "\n",
      "==========\n",
      "\u001b[1m45,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m45,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      " to the crown,\n",
      "Or I'll privy I have.\n",
      "\n",
      "POLIXENES:\n",
      "I have been a stir.\n",
      "\n",
      "LEONTES:\n",
      "The worshiped, the benefition of the crown.\n",
      "\n",
      "His somet\n",
      "==========\n",
      "\u001b[1m50,000 steps reached: saving model to /trained_model\u001b[0m\n",
      "\u001b[1m50,000 steps reached: generating sample texts.\u001b[0m\n",
      "==========\n",
      ":\n",
      "Catesby, girls, and make avoides;\n",
      "But, welcome a far\n",
      "That ever home, like a villain, and behold\n",
      "Canusy not passing nonquial at the g\n",
      "==========\n",
      "Loss: 2.940 — Avg: 2.884: 100%|██████████| 50000/50000 [31:39<00:00, 26.32it/s]\n"
     ]
    }
   ],
   "source": [
    "ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)"
   ]
  },
  {
   "source": [
    "Generate text from your trained model!"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "\u001b[1mROMEO:\u001b[0m\nAbook, ho! forthing me, gentle Earl's royal king,\nAnd this, I, with that I do not beseech you\nTo visit the battle, that I should believe you,\nWhich I would never\n==========\n\u001b[1mROMEO:\u001b[0m\nConfound is gone, thou art a maid into the widow;\nPut up my life and make me no harmony\nAnd make thee I know uncle,\nUnconted and curses: therefore in my\n==========\n\u001b[1mROMEO:\u001b[0m\nGod push! but what days to see\nThe giving bleedom's heart I do? Therefore,\nAnd most unless I had rather. He saddle\nTake your cold shack down; and so far I\n==========\n\u001b[1mROMEO:\u001b[0m\nPersetain'd up the earth of mercy,\nAnd never yet, the sun to make him all the\nMore than my battle.\n\nROMEO:\nI warrant him, to know, we'll not do't, but hate me\n==========\n\u001b[1mROMEO:\u001b[0m\nMethinks I am a mile, and trench one\nThy winded makes, in faults and cast\nWith one to meether, of twenty days,\nThat in my waters, that f\n==========\n\u001b[1mROMEO:\u001b[0m\nO, here is such a woman guilty.\n\nROMEO:\nI do not think it; I should be renowned\nThat I am in that which can controy\nA bawd I take it to the purpose.\n\nJU\n==========\n\u001b[1mROMEO:\u001b[0m\nI know not what I am.\n\nFLORIZEL:\nAy, as I did,\nI would be adverpite of the homely treason\nFrom the doubled in the farm of his bed.\nTa\n==========\n\u001b[1mROMEO:\u001b[0m\nI pray you, he would have taken to him but,\nAnd freely mark his into a fine of it,\nSpeak to the second to our cheek;\nAnd every day, and sanctious cover\n==========\n\u001b[1mROMEO:\u001b[0m\nI had left me--born to be drawn.\n\nJULIET:\nMy husbour, I will have thee here:\nAnd, I have found to seek thyself.\n\nJULIET:\nI will be not b\n==========\n\u001b[1mROMEO:\u001b[0m\nThat is a hour,\nThe castard is, I'll not buy, or indeeding.\n\nNurse:\nLADY CAPULET:\nThe matter, that ta'en as I may find thee.\n\n"
     ]
    }
   ],
   "source": [
    "ai.generate(10, prompt=\"ROMEO:\")"
   ]
  },
  {
   "source": [
    "With your trained model, you can reload the model at any time by providing the `pytorch_model.bin` model weights, the `config`, and the `tokenizer`."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "ai2 = aitextgen(model_folder=\"trained_model\",\n",
    "                tokenizer_file=\"aitextgen.tokenizer.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "\u001b[1mROMEO:\u001b[0m\nBoy, unreacher, unhallupony, in Padua,\nUntimely fall till I be learn'd.\n\nROMEO:\nFie, good friar, be quick, for I am,\nI'll\n==========\n\u001b[1mROMEO:\u001b[0m\nI'll be plain, I am a tail of blessed wounds;\nFor I am dead, I have not borne to make\nA couple of her fortune, but that I'll bear,\nAnd say 'Ay, chur\n==========\n\u001b[1mROMEO:\u001b[0m\nAnd yet I am a resolution of my dear dear:\nIf I have not reason to do me say\nI'll deny the sea of my body to answer,\nAnd all thy tale, or I have my m\n==========\n\u001b[1mROMEO:\u001b[0m\nIntenty to a bawd of my bait,--\n\nJULIET:\nNo, I hope to know the title,\nFor that I wish her place.\n\nJULIET:\nDo I assure her?\n==========\n\u001b[1mROMEO:\u001b[0m\nO, what's the parle that I chide thee,\nThat honourable may be, that I have still'd thee:\nI pray thee, my lord.\n\nMERCUTIO:\nI', my lord.\n\nROMEO:\nHere is a\n==========\n\u001b[1mROMEO:\u001b[0m\nAnd, for I am, and not talk of that?\n\nROMEO:\nWhere's my child, I would guess thee here.\n\nJULIET:\nNay, boy, I'll not be bowling why I;\nO thou\n==========\n\u001b[1mROMEO:\u001b[0m\nO, but thou hast seen thee of mine own.\n\nROMEO:\nI would assist thee--\n\nJULIET:\nAy, it is, and not so.\n\nROMEO:\nNo, but that I must told me with it.\n\nROMEO\n==========\n\u001b[1mROMEO:\u001b[0m\nNo, no, nor I am. I am content.\n\nBENVOLIO:\nI will not, sir: but I have required\nAs I am grown in the lawful virtue\nThat it hath bid you think, and I\n==========\n\u001b[1mROMEO:\u001b[0m\nThat I should pardon, I would be gone.\n\nESCALUS:\nI should believe you, sir, sir, ay, I would not\nnot know more, but that I can, but I would have savour me.\n\nP\n==========\n\u001b[1mROMEO:\u001b[0m\nAnd thou, I will find out thy life the wind of love.\n\nROMEO:\nIt is the morning groom of it.\n\nJULIET:\nFie, good sweet boy, I will take my leave to a happy day,\n"
     ]
    }
   ],
   "source": [
    "ai2.generate(10, prompt=\"ROMEO:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MIT License\n",
    "\n",
    "Copyright (c) 2021 Max Woolf\n",
    "\n",
    "Permission is hereby granted, free of charge, to any person obtaining a copy\n",
    "of this software and associated documentation files (the \"Software\"), to deal\n",
    "in the Software without restriction, including without limitation the rights\n",
    "to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n",
    "copies of the Software, and to permit persons to whom the Software is\n",
    "furnished to do so, subject to the following conditions:\n",
    "\n",
    "The above copyright notice and this permission notice shall be included in all\n",
    "copies or substantial portions of the Software.\n",
    "\n",
    "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n",
    "IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n",
    "FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n",
    "AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n",
    "LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n",
    "OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n",
    "SOFTWARE."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.9.1 64-bit",
   "metadata": {
    "interpreter": {
     "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
    }
   }
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1-final"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}